Ch7: Moving beyond linearity
Philip Lin
2015.1.7
Outline
- Spline approach
- Natural cubic spline
- Regression spline
- Smoothing spline
- Example: B-spline
- Example: Principal Component Analysis
- Local polynomial approach
- Curse of dimensionality
Natural Cubic Spline

- How to find a good-looking curve to interpolate all points (xi,yi) ?
Mathematician says… find a curve that minimize its second derivative over whole range of input x
J(f)=∫xmx1f″(x)2dx
Of all continuous function that are continuous on [x1,xm], have absolutely continuous first derivatives and interpolate {xi,yi} natural cubic spline g is the one that is smoothest in thes sense of minimizing J(f)
Natural Cubic Spline

- Definition (natural cubic spline)
Consider {xi,yi:i=1,…,m} where xi<xi+1 the natural cubic spline g(x) satisfies
- on each interval [xi,xi+1], g is a function made up of sections of cubic polynomial.
- g interpolate points {xi,yi}. i.e., g(xi)=yi
- except for boundary points, the whole spline is continuous to second derivative.
- g″(x1)=g″(xm)=0
- an important idea: With k knots (k=m), we use k−1 peices of cubic polynomial to construct a spline function.
Regression spline
Consider defining a cubic spline function f(x) with k knots, x1,…,xk.
Let βj=f(xj) and δj=f″(xj). Then f(x)=a−j(x)βj+a+j(x)βj+1+c−j(x)δj+c+j(x)δj+1 if xj≤x≤xj+1
where a−j(x)=a+j(x)=c−j=c+j=hj=(xj+1−x)/hj(x−xj)/hj[(xj+1−x)3/hj−hj(xj+1−x)]/6[(x−xj)3/hj−hj(x−xj)]/6xj+1−xj
- I know, it is disgusting. Fortunately, there is anoher kind and beautiful representation of spline function f
f(x)=∑i=1kbi(x)βi

Regression spline
f(x)=∑i=1kbi(x)βi
- This formulation is important and frequently to be used in the research, we call it basis expansion
- In basis expansion, it’s not necessary to set number of knots (k) to be equal to number of observation (n). k could be smaller than n. Note that when k<n, the fitted smooth curve will not interpolate all data points.
- bi(x) is a prespecified basis, till know, we use cubic spline basis only.
- Any kind of basis that is rich enough can be used to represent the true function f. It is similar to build up the same castle with different Lego bricktopias.

- Other rich basis
- Truncated power basis
- B-spline basis
- Wavelet Haar basis
- eigen basis (Principal component analysis)
Smoothing spline
Statistican says that we just need an approximation and in addition, we’d like to control the roughness
∑i=1n{yi−f(xi)}2+λ∫f″(x)2dx
where f(x)=∑kj=1bj(x)βj
- What kind of basis should we use ?
- It’s a long story… different basis has its specialty, it depends on you purpose.
- Basically, we can see that cubic spline basis overlapes with its neighbors. If the overlapes is servere, the coefficient β may be difficult to be estimated.
- The choice of tuning parameter λ is an important issue, it can be tuned by CV, GCV or AIC… etc.
- Compared to the choice of λ, choice of knots is not that important. In the literature, people suggest that we can use many knots to do basis expansion and control λ to adjust the roughness of penalty.

Example: Smoothing spline with B-spline
∑i=1n{yi−f(xi)}2+λ∫f″(x)2dx
where f(x)=∑i=1kBmi(x)βi
Bmi(x)=x−xixi+m+1−xiBm−1i(x)+xi+m+2−xxi+m+2−xi+1Bm−1i+1 i=1,…,k
B−1i=I{xi≤x≤xi+1}
library(mgcv)
set.seed(2)
n <- 400
dat <- gamSim(1,n=n,dist="normal",scale=2)
Gu & Wahba 4 term additive model
x=dat$x2; y=dat$y
plot(x, y)

###
### B-spline
###
library(fda)
bsp <- create.bspline.basis(norder=3, breaks=place.knots(dat$x2, 39))
B <- getbasismatrix(dat$x2[order(dat$x2)], bsp, returnMatrix=TRUE)
D <- diff(diag(40), diff=2)
lambda <- 1
a = solve(t(B)%*%B + lambda*t(D)%*%D)%*%t(B)%*%y[order(x)]
plot(x, y)
for(i in 1:ncol(B)) lines(x[order(x)], a[i]*B[,i], type="l")

plot(x, y)
for(i in 1:ncol(B)) lines(x[order(x)], a[i]*B[,i], type="l")
lines(x[order(x)], B%*%a, lwd=2, type="l")

lambda <- 30
a = solve(t(B)%*%B + lambda*t(D)%*%D)%*%t(B)%*%y[order(x)]
plot(x, y)
for(i in 1:ncol(B)) lines(x[order(x)], a[i]*B[,i], type="l")
lines(x[order(x)], B%*%a, lwd=2, type="l")

Example: principal component analysis
Do you remember the Taylor expansion ? expand f(x) at point x=x0 f(x)=f(x0)+f(1)(x0)(x−x0)+f(2)(x0)(x−x0)2!(x−x0)2+⋯

The former term describes the curve near x0 (more important), the latter terms away from x0 (less important).
Example: principal component analysis
Actually, there is a functional version… Karhunen Loeve Decomposition
Consider a centered stochastic process Xt,t∈[a,b] satisfying a technical continuity condition and with covariance function KX(s,t). Xt admits a decomposition
Xt=∑k=1∞Zkek(t)
where
Smoothing in high dimension


Example: image reconstruction (Eigen basis)
- In this example, we use tensor product approach with eigenbasis to reconstruct an 658×718 image.
library(biOps)
std <- function(x){
y <- matrix(c(0,255),2,1)
A <- cbind(range(x),1)
coefs <- solve(A,y)
xmat <- cbind(x,1)
z <- xmat%*%coefs
z <- round(z)
z[z<0] <- 0
z[z>255] <- 255
z
}
y = imgGreenBand(readJpeg("Ansh.jpg")) # 658 718
s = svd(y)
u <- s$u
v <- s$v
d <- s$d
f <- function(numBasis){
b=0
for(i in 1:numBasis){
b = b + d[i]*as.matrix(u[,i])%*%t(as.matrix(v[,i]))
}
b = apply(b, 1, std)
plot(imagedata(t(b)))
}
f(10)

f(50)

f(100)

The original picture:

- In this example, eigenbasis is decided by data which is different to other user-defined basis such as cubic spline basis, B-spline basis and wavelet basis.
- We can also use user defined basis so that we only have to store the coefficients in our hard disc rather than the entire image.
Example: image reconstruction (B-spline basis)
library(fda)
g <- function(numBasis){
bsp = create.bspline.basis(rangeval=c(1,658), norder=3, nbasis=numBasis)
B1 = getbasismatrix(1:658, bsp, returnMatrix=TRUE)
bsp = create.bspline.basis(rangeval=c(1,718), norder=3, nbasis=numBasis)
B2 = getbasismatrix(1:718, bsp, returnMatrix=TRUE)
D = solve(t(B1)%*%B1)%*%t(B1)%*%y
A = D%*%B2%*%solve(t(B2)%*%B2)
est = round(B1%*%A%*%t(B2))
# est = replace(est, est<0, list(0))
dim(est) = c(658, 718)
est.img = imagedata(apply(est, 1, std), type="grey")
plot(t(est.img))
}
g(10)

g(50)

g(100)

g(300)

Local polynomial approach

Suppose our goal is to estimate the true function m(x), then, consider local estimation on each data point xj
m(x)≈≈m(xj)+m′(xj)(x−xj)+⋯+m(p)p!(x−xj)pβ0+β1(x−xj)
- since we preserve the first two terms of Taylor expansion, x cannot be too far away from xj.
- Now, the parameter of interest is β0, we can minimizing the following formulation
minβ0,β1∑i[yi−β0−β1(xi−xj)]2
- In this formulation, for estimating m(xi), we still need information over all x. To make the Taylor approximate well, we introduce a weight (Kernel function) to emphasize the local information.
minβ0,β1∑i[yi−β0−β1(xi−x)]2Kh(xi−x)
- Note that the kernel is a function of bandwidth h, it should be prespecified.
- The bandwidth decide how much of the neighboring should be included. It is easily to see that the role of bandwidth is similar to the roughness of penalty in the spline approach.
Local polynomial approach
X=⎛⎝⎜⎜⎜1⋮1x1−x⋮xn−x⎞⎠⎟⎟⎟,β=(β0β1),W=⎛⎝⎜⎜⎜Kh(x1−x)⋱Kh(xn−x)⎞⎠⎟⎟⎟
Then, the problem can be reduced to weighted least square minβ(y−Xβ)TW(y−Xβ)
β̂ =(XTWX)−1XTWy
- Once we obtain each β̂ 0(xj), then connect these points then we got the estimate of function m(x)
- Choice of kernel (is not that important compared to the choice of bandwidth)
(ref)
- Choice of bandwidth
- It is a long story … in simple case, just choose it by your eyes…
- Something we should know
- compared to spline approach, the development of local theoretical property in local polynomial is much more easier.
- Something may goes wrong in the boundary.
- We cannot use smoothing function to do extrapolation.
Curse of dimensionality
- When p=1, for axis x∈[1,100] with integer grid, we can fill this axis with 100 points.
- When p=2, for axis x,y∈[1,100] with integer grid, we need 1002 points to maintain the same density.
When p=3, for axis x,y,z∈[1,100] with integer grid, we need 1003 points to maintain the same density.
In high-dimensional problem, can you really catch enough neighborhood ?
What are we going to do ?
y=f(X)+ϵ
- Unsupervised learning
- Unsupervised Learning (Ch10)
- Supervised learning (Inference purpose)
- Linear regression (Ch3)
- Classification (Ch4)
- Resampling method (Ch5)
- Linear model selection and regularization (Ch6)
- Moving beyond linearity (Ch7)
- Supervised learning (Prediction purpose)
- Tree based method (Ch8)
- Support Vector Machine (Ch9)